Suitability Analysis: Best ZCTA Within the Boston Area to Live Without a Car

Tufts University
Department of Urban and Environmental Policy and Planning
UEP-239: Geospatial Programming with Python
Final Project
By: Justina Cheng
May 2021

The purpose of this suitability analysis is to find the best Zip Code Tabulation Area (ZCTA) within the Boston Area for a Tufts University student and a Boston University student to live together without a car (i.e. somewhere equally convenient for both). Though the self-serving nature of the study is specific to students of Tufts and BU, a similar workflow can be conducted as a general analysis of the Boston Area (by omitting the network analysis between Tufts and BU) or by specifying different locations of interest (e.g. HubSpot's Cambridge office and the Museum of Fine Arts). The workflow can also be applied to any city or region, provided data is available.

The study is structured as follows:

  1. Import dependencies
    • Necessary packages can be found in the environment.yml file.
  2. Create a base map of the region.
  3. Import and view mass transit stops and routes.
  4. Limit the study area to the extent of rapid transit lines.
  5. Conduct a shortest path network analysis between Tufts and BU to use as visual reference points for areas that are convenient across indicators for both a Tufts student and a BU student.
  6. Analyze and reclassify six indicators necessary for living without a car:
    • Mass transit accessibility using density of stops.
    • Rent affordability using median rental prices for a two- or three-bedroom home.
    • Necessities:
      • Food availability using density of food establishments (e.g. supermarkets, restaurants, etc.).
      • Health services availability using density of healthcare establishments (e.g. doctors' offices, clinics, etc.).
      • Public services availability using density of public services (e.g. post offices, fire stations, etc.).
        • Leisure potential using density of leisure establishments (e.g. parks, sports facilities, museums, etc.).
  7. Calculate a weighted suitability index based on individual importance of the six indicators.
  8. Analyze results of the suitability index, compare to the locations of and shortest paths between Tufts and BU, and determine which ZCTAs are best suited.

References and data used an be found in the README file.


Import Dependencies


Create and View Base Map of Boston Region Zip Code Tabulation Areas (ZCTAs)

To create a GeoDataFrame of the Boston Region ZCTAs, the following steps were used:

  1. Massachusetts outline with detailed coastline was imported from MassGIS as a GeoDataFrame.
  2. Massachusetts ZCTAs were imported from the Census Bureau as a GeoDataFrame.
  3. The outline and ZCTAs GeoDataFrames were converted to the coordinate reference system (CRS) for the Massachusetts Mainland EPSG 6491.
  4. Boundaries for the Boston Region Metropolitan Planning Organization (MPO) were imported from MassDOT as a GeoDataFrame, and the CRS was converted to EPSG 6491.
  5. The Boston Region was extracted from the MPO.
  6. Massachusetts ZCTAs within the Boston Region were extracted using the centroid of the ZCTAs.
  7. Function convert_n_clip was created to convert a GDF to the CRS of another GDF and clip to the other's extent.
  8. Function read_n_clip was created to read in a shapefile and use convert_n_clip to convert it to the CRS of another GDF and clip to the other's extent.
  9. Massachusetts Surface Water data from MassGIS was processed with read_n_clip with the extent of Boston ZCTAs.

Massachusetts Coastline

Massachusetts ZCTAs

Boston Region Metropolitan Planning Organization (MPO)

Define Functions convert_n_clip and read_n_clip

These two functions will be used to read in, convert the CRS, and clip the extent of GDFs throughout the analysis.

convert_n_clip takes two GeoDataFrames (GDF): one to process (gdf) and one whose extent will be used to clip. The function converts the coordinate reference system (CRS) of the original GDF and clips it to the extent of the extent GDF.

read_n_clip takes a filepath for a shapefile and a GDF whose extent will be used to clip the shapefile. The function reads in the shapefile and uses convert_n_clip to convert the coordinate reference system (CRS) of the original GDF and clip it to the extent of the extent GDF.

Boston Region Surface Water

Though not used for the analysis, surface water in the Boston region is a helpful landmark and reference point for orienting one to the map or for figuring out why some data may appear strange (e.g. why there are no rental points in an area).

Surface water in this analysis is added to some maps for reference, where appropriate.


Import Mass Transit Stops and Routes

Mass/public transit in the Boston area is primarily governed by the Massachusetts Bay Transportation Authority (MBTA) and is available in the form of buses, rapid transit (i.e. the T), and the commuter rail (which is used for traveling to and from the greater metro area). MBTA bus, MBTA T, and MBTA Commuter Rail stops and routes data were all obtained from MassGIS.

MBTA Bus Stops and Routes

Stops were read in and processed with read_n_clip. However, because routes often go over or under water, routes were read in with gpd.read_file and converted to the proper CRS.

MBTA Rapid Transit (T) Stops and Routes

Stops were read in and processed with read_n_clip. However, because routes often go over or under water, routes were read in with gpd.read_file and converted to the proper CRS.

MBTA Commuter Rail Stops and Routes

Stops were read in and processed with read_n_clip. Routes were originally read in with gpd.read_file and converted to the proper CRS to preserve areas over or under water, but the extent greatly exceeded that of the Boston Region MPO. Therefore, routes were read in and processed with read_n_clip.

Map Mass Transit with Base Map


Limit Study Area to Extent of Rapid Transit

Judging by the vast extent of the commuter rail, and the density of bus and T stops, the outer ZCTAs within the Boston Region MPO are more untenable for frequent commutes to work or to campus. A mass transit density analysis that includes the outer ZCTAs will skew those within the range of the bus and the T, so they have been excluded from the study by limiting the extent to that of the T.

Limit with Rectangular Bounds

To limit the study area to the extent of the T for a more realistic comparison of ZCTAs, the following steps were used:

  1. Extract the rectangular bounds of MBTA Rapid Transit (T) stops.
  2. Create a bounding box with shapely.geometry.box.
  3. Add a buffer to the bounding box and store as a new extent.
  4. Extract Boston Area ZCTAs whose centroids are within the extent.

Limit with Convex Hull

There are a number of ZCTAs that are within the defined rectangular bounds but are not close to a T stop. The extent was further limited to the convex hull of T stops with the following steps:

  1. Create a convex hull of T stops with unary_union.convex_hull.
  2. Add a buffer to the convex hull and store as a new extent.
  3. Extract Boston Area ZCTAs that intersect with the extent.

Clip All Relevant GDFs

The following GDFs were clipped to the new rt_zcta extent:

Because bos_rt_node was used to create the extent and bos_rt_route connects all T stops, bos_rt_route does not need to be clipped.


Tufts University and Boston University

To find the locations of Tufts University and Boston University (BU), the Massachusetts Colleges and Universities shapefile was processed with read_n_clip to read the shapefile and clip it to the extent of boston_zcta. Tufts and BU were then extracted into a GeoDataGrame.

Though using NetworkX to conduct a network analysis on mass transit nodes and routes was untenable for this project, a network analysis was conducted on the driveable network within the extent to find shortest paths between Tufts and BU. These paths are intended to be used as visual reference points for areas that are convenient across indicators for both a Tufts student and a BU student.

Find Locations for Tufts and BU

Shortest Path Between Tufts and BU

A network analysis was conducted to find the shortest path between Tufts and BU with the following steps:

  1. Convert ZCTA extent to a polygon using unary_union with a small buffer to account for any nodes or edges over water.
  2. Create the street network in the extent with ox.graph_from_polygon.
  3. Reproject the graph and extract nodes and edges as GDFs.
  4. Convert Tufts and BU GDF to projected CRS.
  5. Find nearest nodes of Tufts and BU with ox.get_nearest_node.
  6. Find shortest paths in both directions by length and by time with nx.shortest_path.
  7. Create a GDF with the paths and the path geometry.
  8. Convert the paths GDF to the CRS of rt_zcta to use later in the study.

Fetch Network Data with NetworkX's graph_from_polygon

Reproject Graph and Extract Nodes and Edges

Find Nearest Nodes for Tufts and BU and Shortest Paths

Convert Routes to GeoDataFrames and Analyze

Convert CRS of Paths GDF to Map with Indicators

The paths GDF was converted to use as visual reference points for areas that are convenient across indicators for both a Tufts student and a BU student. Though they will not match bus routes, they can be used as a very rough proxy for areas that buses would be convenient. Locations near the shortest paths are preferable to ones farther from the paths.


Mass Transit Accessibility

As the study's premise is living without a car, it is crucial that the home's location be easily accessible by mass transit. Density of transit stops was selected as an indicator for convenience of mass transit. While the routes are also important to consider, the stops are the on-off points to transit lines and necessary to accessing the transit systems.

The density of mass transit stops was calculated with the following steps:

  1. Function count_records was created to count the number of records in a GDF within polygons of another GDF (e.g. number of bus stops within a ZCTA).
  2. count_records was used on the GDFs for bus stops, T stops, and train stops in the limited rt_zcta extent.
  3. Function multimerge was created to merge multiple DataFrames on the same column or list of columns.
  4. multimerge was used to add all mass transit stop counts to rt_zcta
  5. Total stops and stop density were calculated for all ZCTAs and mapped.
  6. Mass transit density was reclassified into five classes with the following functions:
    1. Function quantiles was created to find quantile values based on quantile thresholds for a set of values. Quantile values divide the dataset into equally sized segments. The results of this function are then used to reclassify data.
    2. Function reclass_5 was created to reclassify a value into one of five classes given thresholds for reclassifications (e.g. those found with quantiles.
  7. Reclassified mass transit density was mapped and analyzed.

Count Public Transit Nodes per ZCTA

Define Function count_records

count_records takes a GeoDataFrame and counts the number of records within another GDF of polygons. It outputs a DataFrame with the specified polygon column values and counts column. The optional argument op for gpd.sjoin defaults to 'within' unless otherwise specified.

Count MBTA Bus Stops per ZCTA

Count MBTA T Stops per ZCTA

Count Commuter Rail Stops per ZCTA

Define Function multimerge

multimerge takes a base DataFrame and merges it with each DataFrame in a list of DataFrames on the specified column or list of columns and with the specified how. The function assumes the specified column(s) exist(s) across all DataFrames.

Calculate Mass Transit Density per ZCTA

The top value for nodes_density greatly exceeds the next value, despite having a low nodes_count, indicating it is an outlier. The ZCTA in question, ZCTA 02222, appears to contain only TD Garden and North Station. Though setting the nodes_density value for ZCTA 02222 to the median value to prevent skewing the analysis was contemplated, the decision was ultimately made to allow the outlier as reclassification would nearly negate the skew.

Reclassify Mass Transit Density

To reclassify indicators, quantile values need to be found for each indicator and used to reclassify values in roughly equal segments. The quantiles function was created to find any number of quantiles, while the reclass_5 function was created to reclassify indicators into five classes.

Define Functions quantiles and reclass_5

quantiles takes a DataFrame, a column name, and a list of quantile thresholds (between 0 and 1) and outputs a list of quantile values for the column.

reclass_5 takes a value and reclassifies it into 1 of 5 classes given a list of values for class thresholds and order preference.


Rent Affordability

Rent is a crucial expense for any non-homeowner and a large cost-burden, especially in an urban area. Rent affordability in this study was judged by the median rent for a two- or three-bedroom home in each ZCTA.

Data was obtained from Jeff Kaufman's Apartment Price Map, which scrapes data from Padmapper. September 2020 data was used because most leases in the area begin in September. Though the dataset does not cover the entire defined extent (53 of 62 ZCTAs are represented), it was the best source of point data found for the Boston Area.

Zillow data was also considered as a source, but it was much more limited in extent. Only 37 of the 62 ZCTAs in question were available in the Zillow dataset. Zillow analysis has been included as an appendix at the end of the study for those who are curious.

The following steps were used to analyze rental data:

  1. Read in CSV file with point data on rental listings.
  2. Conduct a spatial join to match listings with ZCTAs.
  3. Calculate statistics on rental prices, including len, min, max, median, mean, std.
  4. Reclassify median rent prices in quintiles.

The range of rental prices for 2-3 bedroom homes is very large but the vast majority are concentrated at the lower end. A colormap to map rental prices was accordingly chosen for the most variation at the lower end.

Conduct Spatial Join to Match Rents with ZCTAs

Calculate Statistics for Rent Prices

The highest median rental price is an outlier derived from 4 listings in ZCTA 02199, which is the Prudential Center. Similar to ZCTA 02222 in Mass Transit Density, the decision was made to allow the outlier as reclassification would nearly negate the skew.

Reclassify Rent Prices

According to this analysis, rents are more afforable farther from the city center. This median rent price map is nearly the inverse of the mass transit density map.

Though rental data was only available in 53 of the 62 ZCTAs, the remaining indicators will be analyzed for the entire extent. ZCTAs lacking rental data will be excluded in the overall index calculation at the end to ensure the analysis's fairness across ZCTAs.


Locations and Density of Necessities

Accessibility of necessities and amenities is crucial to living anywhere. For the purposes of this study, necessities were defined as follows:

Some data was imported from sources such as MassGIS while others were retrieved using OpenStreetMap and OSMnx's geometries_from_polygon function.

As these three indicators each has more than one dataset to base analysis on (unlike the preceding rental data), the function calc_density was created to automate density calculations based on counts of records in multiple datasets (e.g. density of food establishments per square kilometer in each ZCTA based on counts of groceries, prepared food, and farmers markets). calc_density can be found in the food analysis where it was first used.

Create Extent in Latitude-Longitude to Use With OSMnx

To use OSMnx geometries_from_polygon, a new polygon of the extent's convex hull was created in latitude-longitude coordinates by converting the T stops shapefile to EPSG:4326 for lat-long and creating a convex hull with unary_union.convex_hull.

Food Within Extent

Groceries and prepared food establishments within this study were found using OSMnx and the appropriate OSM tags.

Farmers markets were obtained from MassGIS.

Groceries

Prepared food

Farmers Markets

Define Function calc_density

This function was created to automate density calculations based on counts of records in multiple datasets (e.g. density of food establishments per square kilometer in each ZCTA based on groceries, prepared food, and farmers markets).

The purpose of this function is to calculate the density of a set of records in a GeoDataFrame of polygons. It takes a base GeoDataFrame and, using function multimerge, merges it with a list of GeoDataFrames on the specified column or columns in the list of on columns and with the specified how. The GDFs within the list are expected to have columns with counts, and those columns are given in a list. The function then adds together the counts in specified columns and calculates the density.

Calculate Food Density per ZCTA

The Downtown Boston area evidently has the most food establishments per square kilometer while outer ZCTAs have fewer. The small food-dense ZCTA near BU is ZCTA 02199, the Prudential Center.

Reclassify Food Density

After reclassification, the distribution of food-dense ZCTAs has evened out. The central area of the map still has the highest concentration of food density, but class 4 (having between 8 and 28 food establishments per sqkm) covers a large area as well.


Health Services Within Extent

Health services were defined as Community Health Centers and Hospitals from MassGIS and found with the appropriate OSM tags for healthcare and amenities, such as clinics and doctors offices. Social services facilities and veterinary servies were omitted.

Community Health Centers

Hospitals

Healthcare

Calculate Health Services Density per ZCTA

Reclassify Health Services Density

This reclassified health density map looks very similar to the reclassified food density map. Both food and health are most dense in the central part of the Boston area.


Public Services Within Extent

Public services were defined in terms of safety (fire and police) and availability of public resources (USPS Post Offices and libraries). Fire stations, police stations, and libraries were obtained from MassGIS. USPS data was obtained from USPS as a CSV of Destination Delivery Units (DDUs) in a specified map extent.

Fire Stations

Police Stations

USPS Post Offices

Libraries

Calculate Public Services Density per ZCTA

Reclassify Public Services Density

Once again, a similar trend is found for public services density as for food and health services density. The central part of the city has the most amenities per sqkm.


Locations and Density of Leisure Establishment

While necessities like food, health, and safety are crucial to living anywhere, living requires leisure as well--life cannot be all work and no play! Leisure features were found using OSMnx and analyzed with the same process as other indicators.

Leisure features were defined using OSM tags for leisure, which define leisure features as "places people go in their spare time". Additional tags were included for amenities, such as bars and theaters, and tourist sites, such as aquariums and museums.

Calculate Leisure Density

leisure_gdfs = [zcta_leisure_count] leisure_cols = ['leisure_count'] zcta_leisure2 = calc_density(rt_zcta, leisure_gdfs, ['ZCTA5CE00'], 'left', leisure_cols, 'leisure_count', 'leisure_density')

zcta_leisure2.sort_values(by='leisure_density', ascending=False).head()

The outlier for leisure density is ZCTA 02199, the Prudential Center.

Reclassify Leisure Density

Needless to say, the trend of the most amenities in the central area of Boston holds for leisure establsihments as well.


Calculate Weighted Suitability Index

Analyzing the reclassification maps of each indicator (mass transit, rent affordability, food establishment, health services, and public services), there is significant variation in desirable ZCTAs for each indicator. A weighted index was created to compare overall suitability across all indicators.

Add Reclassification Columns to ZCTAs

Calculate Weighted Index

A weighted index is highly subjective. Some may value availability of food over cost of rent, while others may value convenience of public transit over all else. Others still may value leisure above all else.

For the purposes of this study (ostensibly concerned with living as a student without a car), rent affordability was given the most weight at 35%, while transit was given 20% and food was given 15% (as transit can be used to go to more food-dense areas). Though health services and public services are categorized as necessities, leisure is sought after more often than either health or public services and was thus given 15%. Health and public services were therefore given 7.5% each.

Because rental data was only available in a segment of ZCTAs, NaN values were preserved across index calculations to ensure a fair comparison.

Map with Folium to View Interactively

While the above choropleth map is insightful, an interactive map is more helpful to view actual locations with reference points beyond Tufts and BU. Interactive maps were created by converting all relevant GDFs to latitude-longitude and plotting them with Folium.

Choropleth Map of Weighted Suitability

Map Transit Routes with Schools


Conclusions

The below DataFrame shows reclassification values for each of the top five weighted ZCTAs in the Boston Area.

Though ZCTAs 02215 and 02115 have the highest suitability score (3.950), they do not meaningfully intersect the shortest paths--they are conveniently near Boston University but not at all near Tufts University. They also have higher median rents: the rent reclassification scores of 2 were outweighed by 5s across all other indicators.

ZCTAs 02145 (Winter Hill area) or 02143 (Union Square area) appear to be the best places in the Boston Area for a Tufts University student and a Boston University student to live together without a car. They are tied for second in overall suitability (3.925) and are on or in between the shortest paths between Tufts and BU. The ultimate choice for housing thereby relies on whether the students value lower rents (02145 rent score is 4, 02143 rent score is 3) or density of mass transit, health services, and public services.


Appendix

Zillow Data Analysis

Zillow data was analyzed but ultimately not used for the following reasons:

The analysis was preserved in the Appendix as an example and to compare with Padmapper data for those interested.

Read In and Pre-Process National Zillow Data

Calculate Average Rent Across Years

Compared to the Padmapper rental data, the extent of Zillow data is lacking in scope. It may be helpful in other regions, but was not of much use for this study.